The objective of this second assignment is to explore the differentially expressed genes from the cleaned and normalized data in assignmnent one. Then rank the thresholded over-representation analysis to highlight the top terms / dominant themes in the top set of genes. Lastly, compare my result with the original literature and find some other supports as well for my result if possible. I make some changes to my assignment one and stored the file as “ammended_A1.Rmd” and imported it here. I will demonstrate briefly what I have changed and I general workflow in A1.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!requireNamespace("limma", quietly = TRUE))
BiocManager::install("limma")
if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
BiocManager::install("ComplexHeatmap")
if (!requireNamespace("edgeR", quietly = TRUE))
BiocManager::install("edgeR")
if (!requireNamespace("EnsDb.Hsapiens.v75", quietly = TRUE))
BiocManager::install("EnsDb.Hsapiens.v75")
if (!requireNamespace("knitr", quietly = TRUE))
BiocManager::install("knitr")
if (!requireNamespace("gprofiler2", quietly = TRUE))
BiocManager::install("gprofiler2")
library(limma)
library(edgeR)
library(EnsDb.Hsapiens.v75)
library(knitr)
library(ComplexHeatmap)
library(circlize)
library(gprofiler2)
My data is from GSE84054
What I have changed in A1: I decided to leave the genes that have duplicated names because Ruth sugguested that they are a very small number compare to my total number of genes, ~1%. So I deleted many steps in A1, and made the cleaning step clearer by showing boxplots, density plots and MDS plots in each cleaning step.
Download data
Group samples - pre-filtered
Check GENEID duplications
Boxplot - prefiltered
Density Plot - prefiltered
MDS plot - prefiltered
Filter
Boxplot - filtered
Density plot - filtered
MDS plot - filtered
Identifier mapping
I proved in A1 that the 17 unmapped gene ids are actually duplicates. The reason why the previous check did not catch them was because if we change the “R” to “0”, they can be detected. (They are exactly same, except locating on the Y chromosome). So here I am just going to remove them.
Group samples - Filtered (SAME)
Apply normalization on expr_filtered
Loaded my normalized data from A1.
normalized_count_data <- read.table(file="GSE84054_normalized_count.txt")
kable(normalized_count_data[1:5, 1:5], type="html")
| GENEID | SYMBOL | RHB037 | RHB038 | RHB041 | |
|---|---|---|---|---|---|
| ENSG00000000003 | ENSG00000000003 | TSPAN6 | 79.809196 | 63.375225 | 37.719858 |
| ENSG00000000419 | ENSG00000000419 | DPM1 | 36.461054 | 38.451386 | 49.614654 |
| ENSG00000000457 | ENSG00000000457 | SCYL3 | 12.060195 | 12.077110 | 6.329049 |
| ENSG00000000460 | ENSG00000000460 | C1orf112 | 6.762435 | 4.972927 | 3.848316 |
| ENSG00000000938 | ENSG00000000938 | FGR | 1.402348 | 3.966502 | 1.940060 |
| ## Explore: Normal | ized | ||||
| 1. boxplot - norma | lized |
# After normalization
data2plot_after <- log2(normalized_count_data[,3:ncol(normalized_count_data)])
{boxplot(data2plot_after, xlab = "Samples", ylab = "log2 CPM",
las = 2, cex = 0.5, cex.lab = 0.5,
cex.axis = 0.5, main = "Filtered and normalized RNASeq Samples")
abline(h = median(apply(data2plot_after, 2, median)), col = "green", lwd = 0.6, lty = "dashed")}
counts_density <- apply(log2(normalized_count_data[,3:ncol(normalized_count_data)]), 2, density)
#calculate the limits across all the samples
xlim <- 0; ylim <- 0
for (i in 1:length(counts_density)) {
xlim <- range(c(xlim, counts_density[[i]]$x));
ylim <- range(c(ylim, counts_density[[i]]$y))
}
cols <- rainbow(length(counts_density))
ltys <- rep(1, length(counts_density))
#plot the first density plot to initialize the plot
plot(counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM",cex.lab = 0.85,
main = "Filtered and normalized RNASeq Samples distribution")
#plot each line
for (i in 1:length(counts_density)) lines(counts_density[[i]], col=cols[i], lty=ltys[i])
#create legend
legend("topright", colnames(data2plot), col=cols, lty=ltys, cex=0.75,
border ="blue", text.col = "green4", merge = TRUE, bg = "gray90")
heatmap_matrix <- normalized_count_data[,3:ncol(normalized_count_data)]
rownames(heatmap_matrix) <- normalized_count_data$GENEID
colnames(heatmap_matrix) <- rownames(samples_filtered)
# MDS plot by "cell_type" in samples
plotMDS(heatmap_matrix, labels=rownames(samples_filtered),
col = c("darkgreen","blue")[factor(samples_filtered$cell_type)],
main = "MDS plot depending on cell type")
pat_colors <- rainbow(12)
pat_colors <- unlist(lapply(pat_colors,FUN=function(x){rep(x,2)}))
# MDS plot by "cell_type" + "patients"in samples
plotMDS(heatmap_matrix, col = pat_colors,
main = "MDS plot depending on both cell type and patients")
Based on the two models in part1, I decide to base my model only on “cell_type”
model_design <- model.matrix(~ samples$cell_type)
kable(model_design, type="html")
| (Intercept) | samples$cell_typeSphere |
|---|---|
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
There are 6988 genes that pass the p-value = 0.05 which is chosen based on the paper.
expressionMatrix <- as.matrix(normalized_count_data[,3:ncol(normalized_count_data)])
rownames(expressionMatrix) <- normalized_count_data$GENEID
colnames(expressionMatrix) <- colnames(normalized_count_data)[3:ncol(normalized_count_data)]
minimalSet <- ExpressionSet(assayData=expressionMatrix)
# fit
fit <- lmFit(minimalSet, model_design)
# Use Bayes
fit2 <- eBayes(fit,trend=TRUE)
# Correction: BH
topfit <- topTable(fit2, coef=ncol(model_design),
adjust.method = "BH",
number = nrow(expressionMatrix))
#merge hgnc names to topfit table
output_hits <- merge(normalized_count_data[,1:2], topfit, by.y = 0, by.x = 1, all.y=TRUE)
#sort by pvalue
output_hits <- output_hits[order(output_hits$P.Value),]
kable(output_hits[1:5,],type="html")
# number of genes that pass threshold p-value = 0.05
length(which(output_hits$P.Value < 0.05)) # 6988
# number of genes that pass correction
length(which(output_hits$adj.P.Val < 0.05)) # 4062
d = DGEList(counts=filtered_data_matrix, group=samples$cell_type)
d <- estimateDisp(d, model_design_pat)
I have shown that my data is suitable for using edgeR for further analysis. The data follows the binomial distribution.
plotMeanVar(d, show.raw.vars = TRUE,
show.tagwise.vars=TRUE,
show.ave.raw.vars = TRUE,
NBline=TRUE,
show.binned.common.disp.vars = TRUE,
main = "Binomial distribution of my data")
The individual dots represent each gene and the blue line is the overall trend line.
plotBCV(d,col.tagwise = "black",col.common = "red",
main = "BCV plot of RNA-seq data")
I used Quasi-likelihood models to fit my data and used QLFTest to test for differential expression. The Quasi-likelihood compares two conditions (primary tumour and tumoursphere) and shows the up and down-regulated genes. I have showed the result below that are sorted by p-value. I also inspected the number of genes that satisty my threshold and correction. I choose to use FDR correction based on the paper as well. There are 7467 genes pass the p-value = 0.05, and 5033 genes that pass the FDR correction.
fit <- glmQLFit(d, model_design)
qlf.sphere_vs_tumour <- glmQLFTest(fit, coef='samples$cell_typeSphere')
kable(topTags(qlf.sphere_vs_tumour), type="html")
# Get all the results
qlf_output_hits <- topTags(qlf.sphere_vs_tumour,
sort.by = "PValue",
n = nrow(normalized_count_data))
# Number of genes that pass the threshold p-value = 0.05
length(which(qlf_output_hits$table$PValue < 0.05)) # 7467
# Number of genes that pass correction
length(which(qlf_output_hits$table$FDR < 0.05)) # 5033
I determined the number of up-regulated genes by selecting every gene that does not pass my p-value: 0.05, and also have a positive log fold change. Down-regulated genes are selected in the same way with a negative log fold change. Stored these data for later enrichment analysis.
# How many genes are up regulated?
length(which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC > 0)) # 1897
# How many genes are down regulated?
length(which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC < 0)) # 5570
# Get those up and down-regulated genes
qlf_output_hits_withgn <- merge(expr_filtered[,1:2],qlf_output_hits, by.x=1, by.y = 0)
upregulated_genes <- qlf_output_hits_withgn$GENEID[which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC > 0)]
downregulated_genes <-qlf_output_hits_withgn$GENEID[which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC < 0)]
# store data
unreg_genes_copy <- data.frame(upregulated_genes)
downreg_genes_copy <- data.frame(downregulated_genes)
names(unreg_genes_copy) <- names(downreg_genes_copy)
all_de <- rbind(unreg_genes_copy, downreg_genes_copy)
colnames(all_de) <- "all_de"
write.table(x=all_de,
file="all_expr_de_genes.txt",sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=upregulated_genes,
file="expr_upregulated_genes.txt",sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=downregulated_genes,
file="expr_downregulated_genes.txt",sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
I have shown the up and down-regulated genes in a volcano plot by coloring them in red and blue, the code is from an example analysis on tomato data
volcanoData <- cbind(qlf_output_hits$table$logFC, -log10(qlf_output_hits$table$FDR))
colnames(volcanoData) <- c("logFC", "Pval")
up <- qlf_output_hits$table$FDR < 0.05 & qlf_output_hits$table$logFC > 0
point.col <- ifelse(up, "red", "black")
plot(volcanoData, pch = 16, col = point.col, cex = 0.5,
main = "Up-regulated genes in RNA-seq data")
down <- qlf_output_hits$table$FDR < 0.05 & qlf_output_hits$table$logFC < 0
point.col <- ifelse(down, "blue", "black")
plot(volcanoData, pch = 16, col = point.col, cex = 0.5,
main = "Down-regulated genes in RNA-seq data")
To test the differential expression, I used the heatmap and it has shown a clear distinction between up and down regulated genes. There is a clear difference between the primary tumour samples and tumoursphere samples.(They are reversed.) The clustering is very obvious to note that differential expression exists. s
top_hits <- rownames(qlf_output_hits$table)[output_hits$P.Value<0.05]
heatmap_matrix_tophits <- t(scale(t(heatmap_matrix[which(rownames(heatmap_matrix) %in% top_hits),])))
heatmap_matrix_tophits <- heatmap_matrix_tophits[, c(grep(colnames(heatmap_matrix_tophits),pattern = "_P"),
grep(colnames(heatmap_matrix_tophits),pattern = "_S"))]
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}
current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE,
cluster_columns = FALSE,
show_row_dend = TRUE,
show_column_dend = FALSE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,)
current_heatmap
There were no REAC found. I used the gprofiler2’s function to query data and also attached the screenshots that I took on their website since the package does not show the number of gene sets each has found.
The some analysis is apply to down regulated
upregulated_genes_sym <- qlf_output_hits_withgn$SYMBOL[which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC > 1)]
length(upregulated_genes_sym) # 1312
upregulated_genes_sym[grep(pattern="ALDH",upregulated_genes_sym)]
# ALDH2, ALDH8A1, ALDH1L2